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ABSTRACT 


To develop a knowledge-aware recommender system, a key issue is how to obtain rich and structured 
knowledge base (KB) information for recommender system (RS) items. Existing data sets or methods either 
use side information from original RSs (containing very few kinds of useful information) or utilize a private 
KB. In this paper, we present KB4Rec v1.0, a data set linking KB information for RSs. It has linked three widely 
used RS data sets with two popular KBs, namely Freebase and YAGO. Based on our linked data set, we first 
preform qualitative analysis experiments, and then we discuss the effect of two important factors (i.e., 
popularity and recency) on whether a RS item can be linked to a KB entity. Finally, we compare several 
knowledge-aware recommendation algorithms on our linked data set. 


1. INTRODUCTION 


Recommender systems (RS), which aim to match users with their interested items, have played an 
important role in various online applications nowadays. Traditional recommendation algorithms mainly 
focus on learning effective preference models from historical user-item interaction data, e.g., matrix 
factorization [1]. With the rapid development of Web technologies, various kinds of side information have 
become available in RSs [2]. At an early stage, the used context information is usually unstructured, and 
its availability is limited to specific data domains or platforms. 


t Corresponding author: Wayne Xin Zhao (Email: batmanfly@gmail.com; ORCID: 0000-0002-8333-61 96). 
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More and more efforts have been made recently by both research and industry communities for structuring 
world knowledge or domain facts in a variety of data domains. One of the most typical organization forms 
is knowledge base (KB) [3]. KBs provide a general and unified way to organize and associate information 
entities, which have been shown to be useful in many applications. For instance, KBs have been used in 
recommender systems, called knowledge-aware recommender systems [4]. To develop a knowledge-aware 
recommender system, a key issue is how to obtain rich and structured KB information for RS items. Overall, 
there are two main solutions from existing studies. First, side information has been collected from the RS 
platform and used as contextual features [5, 6, 7, 8, 9], and some studies further construct tiny and simple 
KB-like knowledge structure [10, 11, 12]. The number of attributes or relations is usually small, and much 
useful item information is likely to be missing. Second, several works propose to link RS with private 
KBs [13, 14, 15]. The linkage results are not publicly available. We are also aware of some closely related 
studies [16, 17], which aim to link RS items with DBpedia entities. By comparsion, our focus is on Freebase 
[18] and YAGO [19], which are now widely used in many nature language processing (NLP) or related 
domains [20, 21, 22]. 


To address the need for the linked data set of RS and KBs, we present a data set which links two public 
KBs with recommender systems, named KB4Rec v1.0, freely available at https://github.com/RUCDM/ 
KB4Rec. Our basic idea is to heuristically link items from RSs with entities from public large-scale KBs®. 
On the RS side, we select three widely used data sets (i.e., MovieLens [5], LFM-1b [6] and Amazon book 
[7]) covering three different data domains, namely movie, music and book; on the KB side, we select the 
two well-known KBs (i.e., Freebase and YAGO). We try to maximize the applicability of our linked data set 
by selecting very popular RS data sets and KBs. We do not share the original data sets, since they are 
maintained by original researchers or publishers. These original copies are easily accessible online. 


In our KB4Rec v1.0 data set, we have organized the linkage results as linked ID pairs, which consist of 
a RS item ID and a KB entity ID. All the IDs are inner values from the original data sets. Once such a 
linkage has been accomplished, it is able to reuse existing large-scale KB data for RSs. For example, the 
movie “Avatar” from MovieLens data set [5] has a corresponding entity entry in Freebase, and we are able 
to obtain its attribute information by retrieving all its associated relation triples in Freebase. Based on the 
linked data set, we first preform some qualitative analysis experiments, and then we discuss the effect of 
two important factors (i.e., popularity and recency) on whether a RS item can be linked to a KB entity. 
Finally, we compare several knowledge-aware recommendation algorithms on our linked data set. 


With our linkage results and original data copies, it is easy to develop an evaluation set for knowledge- 
aware recommendation algorithms. We believe such a data set is beneficial to the development of 
knowledge-aware recommender systems. 


© We use the terms of “items” and “entities,” respectively, for RSs and KBs. 
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2. EXISTING DATA SETS AND METHODS 


In this section, we briefly review the related data sets and methods. 


Early knowledge-aware recommendation algorithms are also called context-aware recommendation 
algorithms, in which the side information from the original RS platform is considered context data. For 
example, social network information of Epinions data set is utilized in [23, 24], POI property information 
of Yelp data set is utilized in [11], movie attribute information of MovieLens data set is utilized in [10] and 
user profile information of microblogging data set has been utilized in [25, 8]. These data sets usually 
contain very few kinds of side information, and the relation between different kinds of side information is 
ignored. 


To obtain more structured side information, Heterogeneous Information Networks (HIN) have been 
proposed as a technique for modeling complex connections between different types of objects [26]. In 
HINs, we can effectively learn underlying relation patterns (called meta-path) and organize side information 
via meta-path-based representations. For example, HIN-based recommendation systems have been applied 
to solve PER [10], HeteRecom [27] and MCRec [28]. HIN based algorithms usually rely on graph search 
algorithms, which is difficult to deal with large-scale relation pattern finding. 


More recently, KBs have become a popular kind of data resources to store and organize world knowledge 
or domain facts. Many studies have been carried out on the construction, inference and applications of 
KBs [3]. In particular, several pioneering studies [13, 14, 15] try to leverage existing KB information for 
improving the recommendation performance. They apply a heuristic method for linking RS items with KB 
entities. In these studies, they use a private KB for linkage, which is not accessible to the public. 


We are also aware of some closely related studies [16, 17], which aim to link RS items with KB entities. 
Nevertheless our focus is on Freebase and YAGO, which are now widely used in many NLP or related 
domains [20, 21, 22]. Besides, our data sets contain more linked entities and involved relations. 


3. LINKED DATA SET CONSTRUCTION 


In our work, we need to prepare two kinds of data sets, namely RS and KB. We first describe the original 
RS and KB data sets and then discuss the linkage method. 


3.1 RS Data Sets 


Consider three popular RS data sets for linkage, namely MovieLlens, LFM-1b and Amazon book, which 
are from three different domains of movie, music and book, respectively. 


e MovieLens data set [7] describes users’ preferences on movies. A preference record takes the form 
<user, item, rating, timestamp>, indicating the rating score of a user on a movie at sometime. There 
have been four MovieLens data sets released, known as 100K, 1M, 10M and 20M, reflecting the 
approximate number of ratings in each data set. We select the largest MovieLens 20M for linkage. 
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e LFM-1b data set [8] describes users’ interaction records on music. It provides information including 
artists, albums, tracks and users, as well as individual listening events. It records the listening events 
of a user on songs, but does not contain rating information. 

e Amazon book data set [9] describes users’ preferences on book products, which has a data form, i.e., 
<user, item, rating, timestamp>. The data set is very sparse, containing 22 million ratings from 8 
million users across nearly 23 million items. 


The three data sets all provide several kinds of side information such as item titles (all), IMDB ID (movie), 
writer (book) and artist (music). We utilize such side information for subsequent KB linkage. 


3.2 KB Data Sets 


We adopt two large-scale pubic KBs, namely Freebase and YAGO. 


Freebase [18] is a KG announced by Metaweb Technologies, Inc. in 2007 and was acquired by Google 
Inc. on July 16, 2010. Freebase stores facts by triples of the form <head, relation, tail>. Since Freebase shut 
down its services on August 31, 2016, we use its latest public version. 


YAGO [19] is a large semantic KB, which is automatically constructed based on the information of 
Wikipedia, WordNet, GeoNames and other data sources. It contains 447 million facts about 9.8 million 
entities in 10 different languages, with an accuracy of above 95% based on manual evaluation. In this 
paper, we use the version of YAGO in [29]. 


3.3 RS to KB Linkage 


With two KB data sets and three RS data sets, we can form six linkage results. Next, we describe the 
heuristic method for data linkage. 


All three RS data sets provide the information of item titles. For Freebase, with offline KB search APIs, 
we retrieve KB entities with item titles as queries. Our heuristic linkage method follows the similar idea in 
[30]. If no KB entity with the exact same title was returned, we say the RS item is rejected in the linkage 
process. If at least one KB entity with the exact same title was returned, we further incorporate one kind 
of side information as a refined constraint for accurate linkage: IMBD ID, artist name and writer name are 
used for the three domains of movie, music and book, respectively. We have found only a small number 
(about 1,000 for each domain) of RS items cannot be accurately linked or rejected via the above procedure, 
and we simply discard them. 


For YAGO, a KB entity is named in a similar way as that in its corresponding Wikipedia URL, in which 
it is composed of the item title and its related information such as type. For example, film “Titanic” is marked 
as “<Titanic_(1997_film)>" in YAGO, and the corresponding Wikipedia page can be accessed through the 
link https://en.wikipedia.org/wiki/Titanic_(1997_film). Therefore, we first compare the title of RS items with 
the prefix of KB entities. If at least one KB entity was returned, we leverage the “rdf:type” relation and suffix 
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(if available) to filter out those entities from other domains. We find that most of the linkage in LFM-1b and 
Amazon book data sets can be determined accurately (either linked or non-linked) in this way. By comparison, 
there exist some ambiguous cases in MovieLens 20M data set, and they are further evaluated through the 
year restriction. 


During the linkage process, we have dealt with several problems that will affect the results of string match 
algorithms, e.g., lowercase, abbreviation and the order of family/given names. Since the LFM-1b data set 
is extremely large, we remove all the music items with fewer than 10 listening events. Even after filtering, 
it still contains about 6.5 million music items. 


We present an illustrative example for our linkage results in Figure 1. In this example, there are two pairs 
of an item from MovieLens 20M and its linked entity from Freebase. The two movie items are “Spider man” 
and “Spider man 2.” It is clear to see that both movies share many common attributes in Freebase. With 
such linkage results, it is easy to obtain rich KB information about RS items, which are likely to be useful 
in recommendation performance. 


Tobey 
Maguire 


Spider man 


Spider man 2 


0145487 0316654 


Figure 1. Linkage example of MovieLens 20M items with Freebase entities. Note: We highlight the MovieLens IDs 
and Freebase IDs. 


3.4 Basic Statistics 


We summarize the basic statistics of the three linked data sets in Table 1. It can be observed that for the 
MovieLens 20M data set, we have a very high linkage ratio: about 95.2% or 79.5% items can be accurately 
linked to an entity from Freebase or YAGO. But for the rest two domains, the linkage ratios are very low, 
especially using YAGO for linkage. MovieLens 20M data set has a high linkage ratio, which is probably 
because that it contains fewer items than the other two data sets, which themselves are refined by original 
releasers. Besides, we speculate that there may be some domain bias in the construction of KBs. Overall, 
more RS items can be linked with Freebase entities than YAGO. Although the linkage ratios for the latter 
two data sets are not high, the absolute numbers of linked items are large. We also report the number of 
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overlapping linked entities for the two KBs in the last row of Table 1. We can see that there are also more 
linked items in the movie domain. Such a linked data set is feasible for research-purpose studies. 


Table 1. Statistics of the linkage results. 


Data sets Numbers MovieLens 20M LFM-1b Amazon book 
RS data sets #users 138,493 120,317 3,468,412 
#items 27,279 6,479,700 2,330,066 
#interactions 20,000,263 1,021,931,544 22,507,155 
Freebase #linked-items 25,982 1,254,923 109,671 
Linkage ratio 95.2% 19.4% 4.7% 
YAGO #linked-items 21,688 49,608 17,607 
Linkage ratio 79.5% 0.8% 0.8% 
Overlap #overlap 21,221 26,126 7,398 


Note: The three domains correspond to the RS data sets of MovieLens 20M, LFM-1b and Amazon book, respectively. 


3.5 Shared Data Sets 


We name the above linked KB data set for recommender systems as KB4Rec v1.0, freely available at 
https://github.com/RUCDM/KB4Rec. In our KB4Rec v1.0 data set, we organized the linkage results by 
linked ID pairs, which consist of a RS item ID and a KB entity ID. All the IDs are inner values from the 
original data sets. For Freebase, we have 25,982, 1,254,923 and 109,671 linked ID pairs for MovieLens 
20M, LFM-1b and Amazon book, respectively; for YAGO, we have 21,688, 49,608 and 17,607 linked ID 
pairs for MovieLens 20M, LFM-1b and Amazon book, respectively. 


4. LINKAGE ANALYSIS 


Previously, we have shown the linkage ratios for different data sets. We find that a considerable amount 
of RS items cannot be linked to KB entities. It is interesting to study what factors will affect the linkage 
ratio. We consider two factors for analysis. 


4.1 Effect of Popularity on Linkage 


Intuitively, a popular RS item should be more likely to be included in a KB than an unpopular item, since 
it is reasonable to incorporate more “important” RS items rated by the RS users into KBs. The construction 
of KB itself usually involves manual efforts, which is difficult to avoid the bias of human attention. To 
measure the popularity of a RS item, we adopt a simple frequency-based method by counting the number 
of users who have interacted with the item. This measure characterizes the attractiveness of an item from 
the users in a RS. First, we sort the items ascendingly according to its popularity value. Then, we further 
equally divide all the items into five ordered bins with the same number in each bin. Hence, an item with 
a larger bin number will be more popular than another with a smaller bin number. Then, we compute the 
linkage ratio for each bin and the results are reported in Figure 2. It can be observed that a bin with a larger 
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number has a higher linkage ratio than the ones with a smaller number. The results indicate that popularity 
is likely to have a positive effect on linkage. 
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Figure 2. Examining the effect of popularity on the linkage results. Note: We use A, B, ... to indicate the bin 
number in an ordered way. The first three subfigures correspond to the popularity analysis on Freebase, and the last 
three subfigures correspond to the popularity analysis on YAGO. 


4.2 Effect of Recency on Linkage 


The second factor we consider is the recency, i.e., the time when a RS item was created. Our assumption 
is that if a RS item was created or released on an earlier time, it would be more probable to be included 
in KBs. Since human attention aggregation is a gradually growing process, a RS item usually requires a 
considerable amount of time to become popular. To check this assumption, we need to obtain the release 
date of RS items. However, only the MovieLens 20M data set contains such an attribute information, so we 
only report the analysis result on this data set. We first sort the items according to their release dates 
ascendingly, and then equally divide all the items into 10 ordered bins following the procedure of the 
above popularity analysis. Finally, we compute the linkage ratios for each bin. The results are reported in 
Figure 3. We can see that the linkage ratios gradually decrease with time going by. The results indicate that 
recency is likely to have a negative effect on linkage, i.e., an older RS item seems to be more probable to 
be included in a KB than a more recent one. In Figure 3 (a), the last bin has a dramatic drop, since our 
version of MovieLens is April 2015. 
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Figure 3. Examining the effect of recency on the linkage results. Note: We use A, B, ... to indicate the bin number 
in an ordered way. The first subfigure corresponds to the recency analysis on Freebase, and the second subfigure 
corresponds to the recency analysis on YAGO. 


The above analysis has indicated that both popularity and recency have a considerable effect on the final 
linkage results. However, the construction process of KB is very complicated, and many important factors 
will affect this process. For future research, it is worth delving into what are other important factors and 
how they affact the construction process of KB. 


5. EXPERIMENT 


In this section, we present the comparison of some existing recommendation algorithms using our linked 
data sets. 


5.1 Experimental Setup 


Our purpose is to test whether the incorporated KB information is useful to improve the recommendation 
performance. In Freebase, there are more linked entities and associated relations. So we only adopt the 
linked data set of Freebase for evaluation, and the results from YAGO are similar and omitted here. 


The original linked data set is very large, so we first generate a small evaluation set for the following 
experiments. We took the subset from the last year for LFM-1b data set and the subset from year 2005 to 
2015 for MovieLens 20M data set. We also perform 3-core filtering for Amazon book data set and 10-core 
filtering for other data sets. This part mainly follows the preprocessing step in [31]. And then, we have kept 
items which are linked by our data set. We report the statistics of data sets in Table 2. 


Table 2. Statistics of the evaluation data sets for the Freebase KB. 


Data sets #users #items #interactions 
MovieLens 20M 61,583 19,533 5,868,015 
LFM-1b 7,694 30,658 203,975 
Amazon book 65,125 69,975 828,560 


Note: In this data set, all the items are linked with Freebase. 
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Following [32], we consider the last-item recommendation task for evaluation. We set up such a task 
since it is a commonly used evaluation setting for RSs, and it is easy to compare different methods. Given 
a user, first we sort the items according to the interaction timestamp ascendingly, and then we take the last 
item into the test set and the rest into training set. The final goal is to predict the last item given the previous 
interaction sequence of a user. Since enumerating all the items as candidate is time-consuming, we pair 
each ground-truth with 100 negative items to form a randomly ordered list. Then each comparison method 
is to return a ranked list according to its recommendation confidence. To evaluate different methods, we 
adopt a variety of evaluation metrics, including the Mean Reciprocal Rank (MRR), Hit Ratio (HR) and 
Normalized Discounted cumulative gain (NDCG). 


5.2 KB Information Representation 


Our focus is to provide rich KB information for recommender systems. A simple way is to represent KB 
information with a one-hot vector, which is sparse and large. Here we borrow the idea in [15, 33] to embed 
KB data into low-dimensional vectors. Then the learned embeddings are used for subsequent recommendation 
algorithms. To train TransE [33], we start with linked entities as seeds and expand the graph with one-step 
search. As not all the relations in KBs are useful, we remove unfrequent and general-purpose relations 
together with all their associated KB triples. After that, each linked item is associated with a learned KB 
embedding vector. We report the statistics for training TransE in Table 3. 


Table 3. Statistics of our subgraph for training TransE. 


Data sets #entities #relations 
MovieLens 20M 1,125,099 81 
LFM-1b 214,524 19 
Amazon book 313,956 49 


Note: #entities indicates the number of entities that are extended by seed entities with one-step search in Freebase. 


5.3 Methods to Compare 


We consider the following methods for performance comparison®: 
e BPR [34]: It learns a matrix factorization model by minimizing the pairwise ranking loss in a Bayesian 
framework. 


e SVDFeature [35]: It is a model for feature-based collaborative filtering. In this paper we use the KB 
embeddings as context features to feed into SVDFeature. 


® Here, since our purpose is to illustrate the use of this linked data set, we only select four methods for performance 
comparison. We will try more knowledge-ware recommendation algorithms in our future work. 
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mCKE [13]: It first proposes to incorporate KB and other information to improve the recommendation 
performance. For fairness, we implement a simplified version of CKE by only using KB information, 
and exclude image and text information. Different from the original CKE, we fix KB representations 
and adopt the learned embeddings by TransE. 

KSR [31]: It is a Knowledge-enhanced Sequential Recommender (KSR). It incorporates KB information 
to enhance the semantic representation memory networks. 


5.4 Results and Analysis 


The results of different methods for the last-item recommendation are presented in Table 4. We can 


see that: 


1). 
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Among all the methods, BPR performs worst on the first two data sets, but very well on the Amazon 
book data set. A possible reason is the first two data sets are relatively dense while the Amazon book 
data set is sparse. A lightweight method is likely to obtain a better performance than more complicated 
methods on a sparse data set. 

SVDFeature is implemented with a pairwise ranking loss function, and it can be roughly understood 
as an enhanced BPR model with the incorporation of the learned KB embeddings. Compared with 
BPR, SVDFeature is slightly better on the MovieLens 20M data set, substantially better on the LFM-1b 
data set, but worse on the Amazon book data set. In SVDFeature, each context feature will incorporate 
some number of parameters (deciding on the number of dimensions). Hence, on a sparse data set, 
it may not work better than the simple BPR model. 

Next, we analyze the performance of the knowledge-aware recommendation methods, namely 
mCKE and KSR. Overall, mCKE does not work well as expected, which only has a good performance 
on the LFM-1b data set. A possible reason is that our implementation of mCKE fixes the learned KB 
embeddings, while the original CKE model adaptively updates KB embeddings. As a comparison, 
the recently proposed KSR method works best consistently on the three data sets. KSR combines the 
capacity of modeling data sequences from Recurrent Neural Networks (RNN) and the capacity of 
storing data in a long term from Memory Networks (MN). It further enhances MNs with the learned 
KB embeddings. 
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Table 4. Performance comparison of different methods on the task of last-item recommendation. 


Data sets Methods MRR Hit@10 NDCG@10 
MovieLens 20M BPR 0.128 0.276 0.144 
SVDFeature 0.204 0.448 0.243 
mCKE 0.178 0.382 0.209 
KSR 0.294 0.571 0.344 
LFM-1b BPR 0.227 0.458 0.265 
SVDFeature 0.337 0.544 0.373 
mCKE 0.371 0.541 0.399 
KSR 0.427 0.607 0.460 
Amazon book BPR 0.222 0.505 0.272 
SVDFeature 0.264 0.544 0.315 
mCKE 0.248 0.494 0.291 
KSR 0.353 0.653 0.413 


6. CONCLUSION 


In this paper, we present KB4Rec v1.0, a data set linking KB information for recommender systems. It 
has linked three widely used RS data sets with the popular KBs Freebase [18] and YAGO [19]. Based on 
our linked data set, we first preform some qualitative analysis experiments, and then we discuss the effect 
of two important factors (i.e., popularity and recency) on whether a RS item can be linked to a KB entity. 
Finally, we compare several knowledge-aware recommendation algorithms on our linked data set. 


For future work, we will consider linking more RS data sets with KBs. We will also test the performance 
of more knowledge-aware recommendation algorithms on more recommendation tasks using the linked 
data set. 
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